Libraries
Read the data
Description of the dataset
This network is a directed and unweighted social communication network, which represents the exchange of emails among members of the Rovira i Virgili University in Spain, in 2003.
Source: Netzschleuder.
Interpreting the email dataset as a SIR model
To help interpret the findings within the context of the university email network, we explain the following settings can specify the following settings. In the setting of a university email network We can consider the epidemic as the spread of rumors between individuals. For example, it could be some sort of negative campaign about a Professor to ruin their reputation (but never a Social Networks teacher obviously - don’t worry Blas, you’re safe :D).
We could the actors in the model:
Susceptible (S): Individuals who have not yet received the email
Infected (I): Individuals who have already recieved the bad email and joined the campaign. They can now send the rumors on to other individuals in the university
Recovered (R): Individuals who have already received the bad email and were infected, but now realize that it’s just a rumor and are now not at risk of believing the email if they see it again AND will not send it to others if they see it. They are immune to the rumor.
We could consider the email as a SI model if it’s
Understanding this model stype:
In terms of model outcome, we could consider the outcomes as:
Susceptible-Infected (SI): Everyone who received the virus is infected and loses control of their computer and the virus can be sent on to all of their connections. (This is the least difficult to justify in this setting)
Susceptible-Infected-Recovered (SIR): Everyone who received the email virus has installed proper seucity software and is not as risk of receiving anymore malware.
Susceptible-Infected-Susceptible (SIS): People have IT make a temporary fix to remove the email virus, but it is not a long term solution. The computer does not have immunity.
Nodes and links
## # A tibble: 6 × 3
## `# index` name `_pos`
## <dbl> <dbl> <chr>
## 1 0 1 array([-7.40536013, 2.36294663])
## 2 1 2 array([-7.44612057, 2.39157152])
## 3 2 3 array([-7.3831753 , 2.37534173])
## 4 3 4 array([-7.41577001, 2.44331821])
## 5 4 5 array([-7.35737624, 2.33315794])
## 6 5 6 array([-7.37931506, 2.36491993])
## # A tibble: 6 × 2
## `# source` target
## <dbl> <dbl>
## 1 0 1
## 2 0 2
## 3 0 3
## 4 0 4
## 5 0 5
## 6 0 6
Within the nodes dataframe, the column
# index is the one indicating the index of the nodes, the
column name would contain the corresponding name of each
node, but they have been anonymized for privacy reason so this column is
still a series of numbers. The _pos column represents the
coordinates of each node in a 2D space, used for visualization or layout
in graph-related tasks.
In the links dataframe, the column # source
and target indicate the indexes of the 2 nodes forming a
link.
Graph
## IGRAPH 69c7b7c DN-- 1133 10903 --
## + attr: name (v/n), _pos (v/c)
## + edges from 69c7b7c (vertex names):
## [1] 1-> 2 1-> 3 1-> 4 1-> 5 1-> 6 1-> 7 1-> 8 1-> 9 1->10 1->11 1->12 1->13
## [13] 1->14 1->15 1->16 1->17 1->18 1->19 1->20 1->21 1->22 1->23 1->24 1->25
## [25] 1->26 1->27 1->28 1->29 1->30 1->31 2->18 2->32 2->13 2->16 2->10 2->24
## [37] 2->33 2->34 2->22 2-> 1 2->35 2-> 3 2-> 9 2->20 2->36 2-> 7 2-> 6 2-> 8
## [49] 2->37 2->11 2-> 4 2->38 2->19 3->39 3->21 3->40 3->13 3-> 2 3-> 1 3-> 7
## [61] 3-> 8 3->34 3->28 3->23 3->41 3->42 3->43 3-> 9 3->11 3-> 4 3->18 3->19
## [73] 3->44 3->16 3->45 3->46 3-> 6 3->20 3->27 3->22 3->47 3->48 3->49 3->50
## [85] 3->12 3->51 3->52 3->53 3->54 3->55 3->56 4->57 4->58 4->59 4-> 1 4->60
## + ... omitted several edges
Initial graph exploration
To understand the graph, we can plot the degree distribution to check whether we have the expected power-law distribution or another type. This will help to interpret the findings later when we remove links.
First, we check for any loops, these would be emails sent to self. And if so, we can simplify the graph to remove these (and also any eventual multiple edge).
paste0("The number of recursive (self-directed) emails in the network is: ",
ecount(graph) - ecount(simplify(graph)))## [1] "The number of recursive (self-directed) emails in the network is: 1"
# we see that one exists, so we simplify the graph to remove it.
graph <- simplify(graph)
# check our graph is setup properly
is_igraph(graph)## [1] TRUE
Next, calculate basic descriptive statistics of the plot:
paste0("There are ", vcount(graph), " nodes and ", ecount(graph), " edges in our Spanish email network")## [1] "There are 1133 nodes and 10902 edges in our Spanish email network"
## [1] TRUE
#degree
degree_all <- degree(graph, mode="all")
degree_out <- degree(graph, mode="out")
paste0("The average degree is: ", round(mean(degree(graph)), 2), " when we DO NOT consider the direction of the emails.")## [1] "The average degree is: 19.24 when we DO NOT consider the direction of the emails."
paste0("The average degree is: ", round(sum(degree_out)/vcount(graph), 2), " when we DO consider the emails to be directed outward from the node (sender) to a recipient")## [1] "The average degree is: 9.62 when we DO consider the emails to be directed outward from the node (sender) to a recipient"
# most connected node:
paste0("The most connected node is number ", which.max(degree(graph))[1], " which has ", max(degree(graph)), " links.")## [1] "The most connected node is number 105 which has 142 links."
For the purposes of this homework, we are not factoring in this directional property of the graph. We are considering that the emails are indicating a social relationship between the two nodes (sender and receiver) and it is irrelevant who sent the email. This means the average interactions for each individual node is 19.24 (which would represent email interactions with 19.24 different people on average).
Now check the degree distribution;
# set average node for label
avg_deg <- round(mean(degree(graph)),2)
# plot linear distribution
ggplot() +
geom_histogram(aes(x = degree_all), bins=40) +
geom_vline(aes(xintercept = avg_deg, colour = "red"), show.legend = FALSE) +
annotate("text", x = avg_deg+5, y = 160, label = paste0("Avg deg = ", avg_deg), angle = 90, color = "red") +
labs(title = "Degree distribution histogram for Uni emails",
x = "Degrees",
y = "Number of nodes") +
theme(axis.title.y = element_text(angle = 90))We see the common power law distribution. There are few very connected nodes and many less connected nodes. We see that the max node must be node number 105 we reported earlier, with its 142 links. This is far more than the second most connected.
Visualization of the network:
# set colour and size only for the top 50 nodes
top_50_nodes <- order(degree_all, decreasing = TRUE)[1:50]
V(graph)$color <- "grey3"
V(graph)[top_50_nodes]$color <- "red"
V(graph)$size <- 0.75
V(graph)[top_50_nodes]$size <- 2
# plot the full network
ggraph(graph, layout = "kk") +
geom_edge_link(width = 0.1, alpha = 0.5, color = "grey") +
geom_node_point(color = V(graph)$color, size = V(graph)$size) +
theme_void() +
ggtitle("Visualisation of the network, 50 most connected nodes highlighted")+
theme(legend.position = "none")We see that the most connected nodes are all centered in the graph. On the outside, we can see the nodes with only a couple of connections.
Steps
1) Find the theoretical epidemic threshold for your network for the information βc to reach a significant number of nodes.
The formula to calculate the theoretical epidemic threshold that we learned from class is:
\[ \beta_c = \mu \frac{\langle k \rangle}{\langle k^2 \rangle}\]
where:
μ (mu): Recovery rate — typically set to 0.1 in SIR models.
⟨k⟩ (mean degree): The average number of connections (edges) per node in the network.
⟨k²⟩ (mean squared degree): The average of the square of each node’s degree — it captures how much variability (or inequality) there is in node connectivity.
Updating our network to be undirected
Firstly, we make our network undirected. We use this to create a more true social network where spreading is symmetric between nodes and can happen either way.
We use the undirected_graph function from igraph with mode=“collapse” to update the graph. The effect of this is:
- It merges reciprocal edges (A→B and B→A) into one link, which matches the idea of a “social tie”.
That’s why with collapse, the number of edges drops by half. this better reflects the true number of relationships.
# Create the undirected version of the graph
graph <- as_undirected(graph, mode = "collapse")
paste0("There are ", vcount(graph), " nodes and ", ecount(graph), " edges in our undirected Spanish email network.")## [1] "There are 1133 nodes and 5451 edges in our undirected Spanish email network."
Now we start to calculate epidemic threshold using degree moments:
# Set recovery rate (mu) — use the standard 0.1 value
mu <- 0.1
# Calculate degree of each node
k <- degree(graph)
# Mean degree and mean squared degree
mean_k <- mean(k)
mean_k2 <- mean(k^2)
# Calculate threshold with our formula: βc = μ * <k> / <k^2>
beta_c <- mu * mean(degree(graph)) / (mean(degree(graph)^2))
# Output the result
paste0("Theoretical epidemic threshold βc is approximately: ", round(beta_c, 5),
" because ⟨k²⟩ = ", round(mean_k2, 4), " > ⟨k⟩ = ", round(mean_k, 4))## [1] "Theoretical epidemic threshold βc is approximately: 0.00535 because ⟨k²⟩ = 179.8164 > ⟨k⟩ = 9.6222"
The estimated threshold βc ≈ 0.00535 is very low, indicating that even a small infection rate β would allow the information to reach a significant portion of the network.
This low threshold is a direct consequence of the high variance in
degree:
⟨k⟩ = 9.62 but ⟨k²⟩ = 179.82, which suggests the presence of a few nodes with very high degree (i.e., “super spreaders”).
The large gap between ⟨k⟩ and ⟨k²⟩ is a signature of heterogeneity in the network. This is typical in real-world social networks, where most people have few contacts, and a few individuals are highly connected.
2) Assuming that randomly-selected 1% initial spreaders, simulate the SIR model below and above that threshold and plot the number of infected people as a function of β.
Here we simulate the SIR model with different β values. In the model, the values represent:
- 0 = S, 1 = I, 2 = R
The broad process is:
Set the inputs - graph, threshold, recovery rate & seeds.
Initialise all nodes as 0 in period 0, except infected which are the randomly assigned seeds.
Create output table with time and infection period
Loop through while there are still infected people. Adding new infections to the table in the period they became infected.
All nodes start as value 0 as they’re uninfected. We then apply infection to the randomly selected seeds and create the function for the SIR model:
# Set recovery rate μ
mu <- 0.1
# SIR simulation function
sim_sir <- function(g, beta, mu, seeds){
state <- rep(0, vcount(g))
state[seeds] <- 1 # set initial spreaders
t <- 0
table <- data.frame(t=0, inf=seeds)
while(sum(state == 1) > 0){
t <- t + 1
new_inf <- c()
for(i in which(state == 1)){ # among all infected ones
neighbors <- neighbors(g, i)
for(j in neighbors){
if(state[j] == 0 && runif(1) < beta){
# If neighbor j is susceptible (i.e., state[j] == 0), then infect it with probability β. "runif(1) < beta" means to generate a random number between 0 and 1. If it’s less than β, the infection is considered successful.
new_inf <- c(new_inf, j)
}
}
if(runif(1) < mu){ # Recover an infected node with probability μ
state[i] <- 2
}
}
new_inf <- unique(new_inf) # fix: avoid duplicates from multiple infected neighbors
state[new_inf] <- 1
if(length(new_inf) > 0){ # only record when there is new infection
table <- rbind(table, data.frame(t=t, inf=new_inf))
}
}
distinct(table) # adding distinct function to keep only unique nodes at each stage, avoiding repeated counts from infections to multiple neighbours in the same time period
}Now we run the simulations over different values. Since our critical threshold was 0.00535, we select 2 values smaller than the threshold and 2 above.
# set the thresholds, β, to test (lower than, close to, or higher than βc)
beta_vals <- c(0, 0.002, 0.005, beta_c, 0.01, 0.02, 0.1)
# keep recovery rate (mu) at 0.1
mu <- 0.1
# randomly set 1% as seeds - this is 12 nodes (of our 1133 total)
set.seed(123)
num_seeds <- ceiling(0.01 * vcount(graph))
seeds <- sample(1:vcount(graph), num_seeds)
# run n times of simulation for each β. Calculate infection total for each.
n_runs <- 10
results <- data.frame()
for(beta in beta_vals){
final_counts <- c()
for(i in 1:n_runs){
sim <- sim_sir(graph, beta, mu, seeds)
final_counts <- c(final_counts, length(unique(sim$inf)))
}
avg_inf <- mean(final_counts)
sd_inf <- sd(final_counts)
results <- rbind(results, data.frame(beta=beta, avg_infected=avg_inf, sd=round(sd_inf,2)))
}
# and view the results:
results## beta avg_infected sd
## 1 0.000000000 12.0 0.00
## 2 0.002000000 15.0 2.21
## 3 0.005000000 31.8 16.63
## 4 0.005351148 18.2 6.18
## 5 0.010000000 298.9 150.22
## 6 0.020000000 635.3 38.37
## 7 0.100000000 1008.4 13.38
Now we can plot the results to compare more easily:
# viz:β vs averaged number of infected people
ggplot(results, aes(x=beta, y=avg_infected)) +
geom_line(color="steelblue") +
geom_point(size=3) +
geom_errorbar(aes(ymin=avg_infected - sd,
ymax=avg_infected + sd),
width=0.001, alpha=0.6) +
labs(title = "Infection Size vs β (Random 1% Seeds)",
x = expression(beta),
y = "Average number of infected nodes") +
theme_minimal()The plot clearly demonstrates the presence of an epidemic threshold in the network. When the infection rate β is below approximately 0.00535, the average number of infected nodes remains low, indicating limited or failed spread. However, just beyond this value, there is a sudden and sharp increase in the number of infections, revealing a nonlinear transition typical of epidemic processes in complex networks.
Furthermore, the plot shows a monotonic increase in the average infection size as β increases, aligning with theoretical expectations from the SIR model — higher infection rates naturally result in more widespread diffusion. Notably, the uncertainty (standard deviation) of infection size is highest around β = 0.01, suggesting that outcomes near the threshold are highly sensitive to stochastic factors such as seed selection and local network topology.
Overall, the plot illustrates a classic epidemic threshold phenomenon. Once β exceeds the critical value, even a small increase in the infection probability can lead to a disproportionately large outbreak, confirming the nonlinear nature of spreading processes in heterogeneous networks.
Now we plot the general trend to try identify smoothness (note that this plot only has 1 run per point so could be considered less robust than the previous).
# set the seeds for consistency in the starting infection nodes:
set.seed(123)
num_seeds <- ceiling(0.01 * vcount(graph))
seeds <- sample(1:vcount(graph), num_seeds)
# plot the outputs over time
results <- map_dfr(seq(0,0.1,0.005), # beta range
\(beta){
realization <- sim_sir(graph,beta,mu,seeds) # make a realization for each beta
data.frame(beta,
ninf=nrow(realization)) # create dataframe column for beta and for number infected
})
results %>% ggplot(aes(x=beta,y=ninf))+
geom_point()+
geom_line(color="red", alpha=0.5)+
geom_vline(xintercept = beta_c,linetype=2)+
labs(title = "Comaparing total infections to beta values")We see the trend holds. There is a very sharp increase around the critical threshold. Note that this value is very low and the increments between each value in the plot are also small - each representing a 0.5% increase in the beta threshold (\(\beta = 0.005\)).
With such a low critical threshold, we are very likely to end up with an epidemic in this model. Any growth is likely to lead to an epidemic. This is seen because our critical beta value is below 1% (around 0.535%). This means that in the following questions, when we select any 1% sample, whether it is random or targeted, it will likely lead to an epidemic.
To illustrate this, we see the basic reproduction number with 1 and 2 percent infection. The basic reproduction number (R₀) when beta is 0.02:
mu <- 0.1 # Recovery rate
# Calculate R₀ options a
R0_1 <- (0.01 / mu) * (mean_k2 / mean_k)
R0_2 <- (0.02 / mu) * (mean_k2 / mean_k)
# print result
paste0("Basic reproduction number R₀ ≈ ", round(R0_1, 3), " with a 1% infection rate. Or, ", round(R0_2, 3), " with a 2% infection rate.")## [1] "Basic reproduction number R₀ ≈ 1.869 with a 1% infection rate. Or, 3.738 with a 2% infection rate."
R₀ > 1 means the epidemic is likely to spread. We obtained an R₀ of 3.738, that is, very probable to spread, when the infection probability per contact per time step is only 2%.
Seems like these rumors will be spreading in the university!
3) Choose a β well-above above βc. Using centrality , communities or any other suitable metric, find a better set of 1% of seeds in the network so we get more infected people than the random case. Measure the difference of your choice with the random case as:
a) The difference in the total number of infected people
First, we set our very high β. Since our actual threshold βc is VERY low (<0.01), we chose a beta value that may seem low but is actually very high relative to our value.
We select β = 0.15
We compare the random seeds to Degree centrality and PageRank to compare a more simple measure to a more rich measure. These two should be correlated, but we include both to see if the more rich PageRank calculation identifies nodes which lead to more rapid spreading.
# set the comparatively high beta and keep our same recovery rate.
beta_high <- 0.15
mu <- 0.1
num_seeds <- ceiling(0.01 * vcount(graph))
## Collect each type of sample to compare the outcomes
# 1. Random selection (baseline)
set.seed(123)
random_seeds <- sample(1:vcount(graph), num_seeds)
# 2. Degree centrality
degree_cent <- degree(graph)
degree_seeds <- order(degree_cent, decreasing = TRUE)[1:num_seeds]
# 3. PageRank
pagerank_cent <- page_rank(graph)$vector
pagerank_seeds <- order(pagerank_cent, decreasing = TRUE)[1:num_seeds]
# check the centrality measures arent' identical nodes for redundancy
identical(degree_seeds, pagerank_seeds)## [1] FALSE
From this, we have our 1% of central nodes (which is equal to 12 nodes in our data) and we now compare the random selection vs targeted selection to see how rapidly the epidemic can spread.
## run simulations based on each of the selected and random nodes
random_sim <- sim_sir(graph, beta_high, mu, random_seeds)
degree_sim <- sim_sir(graph, beta_high, mu, degree_seeds)
pagerank_sim <- sim_sir(graph, beta_high, mu, pagerank_seeds)# we use the unique option to ensure we have properly accounted for any duplicates (although this appears to function properly)
rand_num_inf <- length(unique(random_sim$inf))
deg_num_inf <- length(unique(degree_sim$inf))
pager_num_inf <- length(unique(pagerank_sim$inf))Print all of our results for total infections. They are relatively similar at this high value. Which is not surprising.
Of our total, 1,133 people in the university network, we see that nearly all are infected in each model. This is because of the high infection rate above our critical threshold leaving to long run epidemics.
cat(
"Of the ", vcount(graph), " nodes. We see the following under each setting:\n",
"Random selection: ", rand_num_inf, " (equal to ", round(rand_num_inf/vcount(graph)*100,2), "%)\n",
"Degree centrality: ", deg_num_inf, " (equal to ", round(deg_num_inf/vcount(graph)*100,2), "%)\n",
"Random selection: ", pager_num_inf, " (equal to ", round(pager_num_inf/vcount(graph)*100,2), "%)")## Of the 1133 nodes. We see the following under each setting:
## Random selection: 1056 (equal to 93.2 %)
## Degree centrality: 1052 (equal to 92.85 %)
## Random selection: 1066 (equal to 94.09 %)
We see that at such high beta values, almost everyone is infected regardless of the selection method. As we are above the critical threshold, it becomes an epidemic regardless of who starts it.
b) The difference in the time of the peak of infection (when most infections happen).
First, aggregate the data into one table from the 3 outputs
## line plot with both matched, or 2 plots if we show each of susceptible, infected and recovered
# add labels to merge the DF
random_sim$type <- "Random"
degree_sim$type <- "Degree centrality"
pagerank_sim$type <- "PageRank"
results <- rbind(random_sim, degree_sim, pagerank_sim)
table(results$type)##
## Degree centrality PageRank Random
## 1052 1066 1056
We now have a long table with our 3 data points.
## aggregate the number of infections by each selection method first then plot
results |>
group_by(type, t) |>
summarise(sum_inf = n(), .groups = 'drop') |>
ggplot(aes(x = t, y = sum_inf, group = type, colour = type)) +
geom_line() +
labs(title = "Comparison of infection trends with β=15%, targeted vs random selection",
y = "Number of infections",
x = "Time period",
colour = "Selection method") +
theme_minimal() + # Or your preferred theme
theme(axis.title.y = element_text(angle = 90))Summary - we can show that targeted infection. I.e. those who are the most influential in the email system being the first ones to spread the rumor can make the rumor spread faster than if random people in the network try to start the rumor.
PageRank is able to create a slightly faster infection peak and slightly higher than the degree centrality measure as hypothesised. However these 2 are highly correlated.
The peaks between each network are comparable however, because we have such a high beta value. Each selection method has a peak of around 225 infections on their biggest day of spreading. The only real difference is a delay in the occurrence of the peak spreading.
# calculate the peak period in each
results |>
group_by(type, t) |>
summarise(sum_inf = n(), .groups = 'drop') |>
group_by(type) |> filter(sum_inf == max(sum_inf))## # A tibble: 3 × 3
## # Groups: type [3]
## type t sum_inf
## <chr> <dbl> <int>
## 1 Degree centrality 3 218
## 2 PageRank 3 231
## 3 Random 5 223
We see that the peak for PageRank is period 3 (231 infections), Degree centrality is period 3 as well, (218 infections) and the random allocation was period 5 (223 infections). So there was a 2 period acceleration in the peak by using targeted nodes.
To illustrate whether the peaks differ at other levels, we quickly create a model with the beta infection rate as 3x the critical beta value, (which would be equal to just over 1.5% and almost 10 times less infectious than the above plot:
## comapring with new more moderate (but still high relative to critical threshold) beta value
beta_high2 <- beta_c*3
beta_high2## [1] 0.01605344
random_sim <- sim_sir(graph, beta_high2, mu, random_seeds)
degree_sim <- sim_sir(graph, beta_high2, mu, degree_seeds)
pagerank_sim <- sim_sir(graph, beta_high2, mu, pagerank_seeds)
# add labels, join and check total values.
random_sim$type <- "Random"
degree_sim$type <- "Degree centrality"
pagerank_sim$type <- "PageRank"
results <- rbind(random_sim, degree_sim, pagerank_sim)
table(results$type) # we see the total infections are much lower. ##
## Degree centrality PageRank Random
## 519 540 504
## now plot again to compare the trends.
results |>
group_by(type, t) |>
summarise(sum_inf = n(), .groups = 'drop') |>
ggplot(aes(x = t, y = sum_inf, group = type, colour = type)) +
geom_line() +
labs(title = paste0("Comparison of infection trends with β=",round(beta_high2*100,2),"%, targeted vs random selection"),
y = "Number of infections",
x = "Time period",
colour = "Selection method") +
theme_minimal() + # Or your preferred theme
theme(axis.title.y = element_text(angle = 90))Here we see the similar trend but on a much smaller scale and with more visible noise. The Degree Centrality measure dies out almost by about period 40. At this stage, the random assignment is just starting to find it’s peak values. Similarly, the infections drop off at the start for our random assignment when the selected samples immediately peak due to the importance of the infected nodes initially infected.
4) Using the same β, design a “quarantine strategy”: at time step t=3 or 4, quarantine of 20% the susceptible population. You can model quarantine by temporally removing these nodes. Release the quarantined nodes 8 time steps later, making them susceptible again. Measure the difference with respect to no quarantine.
We will create firstly a new function with the quarantine strategy (based on our previous regular SIR model), where we quarantine 20% of the susceptible population at t=4, then release them at t=12.
sim_sir_quarantine <- function(g, beta, mu, seeds, quarantine_t = 4, release_t = 12){
state <- rep(0, vcount(g)) # 0 = S, 1 = I, 2 = R, -1 = Quarantined
state[seeds] <- 1 # set initial spreaders
t <- 0
table <- data.frame(t=0, inf=seeds)
quarantine_group <- c()
while(sum(state == 1) > 0){ # while there is still infected people
t <- t + 1
new_inf <- c()
# quarantine 20% of susceptible nodes at time t = quarantine_t
if(t == quarantine_t){
susceptible_nodes <- which(state == 0)
set.seed(t) # reproducibility
quarantine_group <- sample(susceptible_nodes, ceiling(0.2 * length(susceptible_nodes)))
state[quarantine_group] <- -1 # set them as "quarantined"
}
# release them back at time t = release_t
if(t == release_t && length(quarantine_group) > 0){
state[quarantine_group] <- 0 # become susceptible again
}
for(i in which(state == 1)){ # loop through infected nodes
neighbors <- neighbors(g, i)
for(j in neighbors){
if(state[j] == 0 && runif(1) < beta){
new_inf <- c(new_inf, j)
}
}
if(runif(1) < mu){
state[i] <- 2 # recover
}
}
new_inf <- unique(new_inf) # fix: avoid duplicate infections
state[new_inf] <- 1
if(length(new_inf) > 0){
table <- rbind(table, data.frame(t=t, inf=new_inf))
}
}
table
}Then, we will simulate 20 times with the regular SIR model and the SIR model with quarantine policies to observe the differences in the long run and short run (during the quarantine period).
# set the very high beta level to same as q3
beta_high = 0.15
# Run simulation 20 times with quarantine
results_q <- replicate(20, {
sim <- sim_sir_quarantine(graph, beta_high, mu, seeds)
length(unique(sim$inf)) #register the total number of infected
})
# Run simulation 20 times without quarantine
results_no_q <- replicate(20, {
sim <- sim_sir(graph, beta_high, mu, seeds)
length(unique(sim$inf)) #register the total number of infected
})
# Compare average results
mean_q <- mean(results_q)
mean_no_q <- mean(results_no_q)
diff <- mean_no_q - mean_q
paste0("With quarantine, avg infected = ", round(mean_q),
"; without quarantine = ", round(mean_no_q),
" → Difference = ", round(diff), " nodes.")## [1] "With quarantine, avg infected = 1029; without quarantine = 1051 → Difference = 22 nodes."
Long-run outcomes with high beta value.
We surprisingly observed that after running 20 simulations for each strategy, the quarantine scenario resulted in almost equivalent total infections (1029 vs 1051) then the model without quarantine policy, with a difference of 22 less infections.
This suggests that the quarantine intervention is not effective in reducing spread in the long run. This tells us that the beta value is so high that an epidemic will occur in the long run if the virus has time to spread freely (i.e. the quarantine period is temporary). The following sections will compare and provide more detail on the random allocation.
We will plot a box plot to compare the long run total infections for each model.
# tidy the data for later visualization
df_results <- data.frame(
infected = c(results_no_q, results_q),
strategy = rep(c("No Quarantine", "With Quarantine"), each = 10)
)
# Visualize the difference of results with boxplot
ggplot(df_results, aes(x = strategy, y = infected, fill = strategy)) +
geom_boxplot(alpha = 0.5, width = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.1, size = 2, alpha = 0.7) + # Each dot is one simulation
labs(title = "Comparison of Final Infection Counts, β = 0.15",
x = "Strategy",
y = "Total number of infected nodes") +
theme_minimal() +
theme(legend.position = "none")The results reveal that the median infection counts are very similar between the two strategies, indicating no strong difference in average outcomes. However, the no-quarantine group exhibits slightly more variability, with a few more extreme values on both ends.
Interestingly, the quarantine strategy does not significantly reduce infections — in fact, it slightly increases the average count in this particular setting. This may be due to reintroducing susceptible nodes back into the network at a time when the infection is still active, or due to randomly quarantining non-critical nodes.
Overall, this visualization suggests that random temporary quarantine of 20% of susceptible individuals at t = 4 to t=12 is not an effective intervention at high infection rates (β = 0.15). Future strategies may benefit from more targeted approaches, such as quarantining high-centrality nodes.
Short-term quarantine outcomes with high beta value
Apart from the final result, does the quarantine policy work throughout the quarantine? If it can slow the spread, we can at least show the quarantine works to some extent.
# We run the simulation 20 times again, as previously we only obtained the final infected number, but we didn't register the infected number at each stage
set.seed(123)
runs_no_q <- map(1:20, function(i) {
sim <- sim_sir(graph, beta_high, mu, seeds)
sim$run <- i
sim
})
runs_q <- map(1:20, function(i) {
sim <- sim_sir_quarantine(graph, beta_high, mu, seeds)
sim$run <- i
sim
})
# Join simulation data from both policies
all_runs_no_q <- bind_rows(runs_no_q) %>% mutate(strategy = "No Quarantine")
all_runs_q <- bind_rows(runs_q) %>% mutate(strategy = "With Quarantine")
all_runs <- bind_rows(all_runs_no_q, all_runs_q)Now we calculate the cumulative infected number and its confidence interval in each time period for both policies, and visualize the results.
cumulative_all <- all_runs %>%
group_by(strategy, run, t) %>%
summarise(new_inf = n(), .groups = "drop") %>%
group_by(strategy, run) %>%
arrange(t) %>%
mutate(cum_inf = cumsum(new_inf)) %>%
group_by(strategy, t) %>%
summarise(
mean_inf = mean(cum_inf),
sd_inf = sd(cum_inf),
ci_upper = mean_inf + 1.96 * sd_inf / sqrt(50),
ci_lower = mean_inf - 1.96 * sd_inf / sqrt(50),
.groups = "drop"
)
ggplot(cumulative_all, aes(x = t, y = mean_inf, color = strategy, fill = strategy)) +
geom_line(linewidth = 0.4) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.3, color = NA) +
geom_vline(xintercept = 4, linetype = "dashed", color = "darkred") +
geom_vline(xintercept = 12, linetype = "dashed", color = "darkblue") +
annotate("text", x = 4, y = 0, label = "Start quarantine (t=4)", angle = 90, vjust = -0.5,hjust=-1, color = "darkred", size = 3) +
annotate("text", x = 12, y = 0, label = "End quarantine (t=12)", angle = 90, vjust = -0.5, hjust=-1, color = "darkblue", size = 3) +
labs(title = "Infection Spread with quarantine, β = 0.15",
subtitle = "With vs Without Quarantine (20 simulations each)",
x = "Time step (t)",
y = "Cumulative number of infected nodes") +
theme_minimal() +
theme(legend.title = element_blank())This plot illustrates the average cumulative number of infected nodes over time across 20 simulations, comparing the No Quarantine and With Quarantine strategies, where the vertical dashed lines indicate the start (t = 4) and end (t = 12) of the quarantine period.
We observe that the quarantine period does reduce infections. With on average 138.25 fewer infections by period 12 (the end of the quarantine period). We see that the growth in infections is still rapid with quarantine, so it doesn’t flatten the rate of spreading very effectively.
# check differences in counts at end of quarantine
no_q_t12 <- cumulative_all |>
filter(t == 12) |> select(strategy, mean_inf) |> filter(strategy == "No Quarantine") |> pull(mean_inf)
yes_q_t12 <- cumulative_all |>
filter(t == 12) |> select(strategy, mean_inf) |> filter(strategy == "With Quarantine") |> pull(mean_inf)
paste0("The average infections without quarantine at time 12 was: ", no_q_t12, " while with quarantine, it reduced to ", yes_q_t12, " in time 12. Indicating a reduction of ", no_q_t12-yes_q_t12, " at the end of the quarantine period.")## [1] "The average infections without quarantine at time 12 was: 1007.25 while with quarantine, it reduced to 867.9 in time 12. Indicating a reduction of 139.35 at the end of the quarantine period."
As discussed above, in the long run, an epidemic emerges and there are no differences.
This suggests that the random quarantine of 20% susceptible individuals has limited impact on the epidemic’s total spread under a high transmission rate (β = 0.15).
In conclusion, while the quarantine flattens the curve slightly, a more targeted strategy may be needed to produce meaningful reductions in total infections.
Testing with lower beta (still above threshold).
Now we had set a relatively high beta, 15%. We also wonder if the quarantine policy would work better while we have lower beta that still sits above our critical threshold. Therefore, we will try a new beta 1%, which is still higher than the critical threshold but is much lower than our previous 15%.
Here we reproduce our previous work but with beta of 1%.
# set a lower beta
beta_high = 0.01
# Run simulation 20 times with quarantine
results_q <- replicate(20, {
sim <- sim_sir_quarantine(graph, beta_high, mu, seeds)
length(unique(sim$inf)) #register the total number of infected
})
# Run simulation 20 times without quarantine
results_no_q <- replicate(20, {
sim <- sim_sir(graph, beta_high, mu, seeds)
length(unique(sim$inf)) #register the total number of infected
})
# Compare average results
mean_q <- mean(results_q)
mean_no_q <- mean(results_no_q)
diff <- mean_no_q - mean_q
paste0("With quarantine, avg infected = ", round(mean_q),
"; without quarantine = ", round(mean_no_q),
" → Difference = ", round(diff), " nodes.")## [1] "With quarantine, avg infected = 265; without quarantine = 300 → Difference = 35 nodes."
We actually see the infection rate is higher without quarantine in the long run. Overall infections are lower however. We test more robustly:
# tidy the data for later visualization
df_results <- data.frame(
infected = c(results_no_q, results_q),
strategy = rep(c("No Quarantine", "With Quarantine"), each = 10)
)
# Visualize the difference of results with boxplot
ggplot(df_results, aes(x = strategy, y = infected, fill = strategy)) +
geom_boxplot(alpha = 0.5, width = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.1, size = 2, alpha = 0.7) + # Each dot is one simulation
labs(title = "Comparison of Final Infection Counts (β = 0.01)",
x = "Strategy",
y = "Total number of infected nodes") +
theme_minimal() +
theme(legend.position = "none")We see that in the long run, quarantine appears to have a slightly lower total infection rate, but it is not a clear significant difference. At the lower infection we see some cases with little spread too - likely due to lowly connected nodes being the initial spreaders.
# We run the simulation 20 times again, as previously we only obtained the final infected number, but we didn't register the infected number at each stage
set.seed(123)
runs_no_q <- map(1:20, function(i) {
sim <- sim_sir(graph, beta_high, mu, seeds)
sim$run <- i
sim
})
runs_q <- map(1:20, function(i) {
sim <- sim_sir_quarantine(graph, beta_high, mu, seeds)
sim$run <- i
sim
})
# Join simulation data from both policies
all_runs_no_q <- bind_rows(runs_no_q) %>% mutate(strategy = "No Quarantine")
all_runs_q <- bind_rows(runs_q) %>% mutate(strategy = "With Quarantine")
all_runs <- bind_rows(all_runs_no_q, all_runs_q)And summarise + plot again for trend over time.
cumulative_all <- all_runs %>%
group_by(strategy, run, t) %>%
summarise(new_inf = n(), .groups = "drop") %>%
group_by(strategy, run) %>%
arrange(t) %>%
mutate(cum_inf = cumsum(new_inf)) %>%
group_by(strategy, t) %>%
summarise(
mean_inf = mean(cum_inf),
sd_inf = sd(cum_inf),
ci_upper = mean_inf + 1.96 * sd_inf / sqrt(20),
ci_lower = mean_inf - 1.96 * sd_inf / sqrt(20),
.groups = "drop"
)
ggplot(cumulative_all, aes(x = t, y = mean_inf, color = strategy, fill = strategy)) +
geom_line(size = 0.4) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.3, color = NA) +
geom_vline(xintercept = 4, linetype = "dashed", color = "darkred") +
geom_vline(xintercept = 12, linetype = "dashed", color = "darkblue") +
annotate("text", x = 4, y = 0, label = "Start quarantine (t=4)", angle = 90, vjust = -0.5,hjust=-1, color = "darkred", size = 3) +
annotate("text", x = 12, y = 0, label = "End quarantine (t=12)", angle = 90, vjust = -0.5, hjust=-1, color = "darkblue", size = 3) +
labs(title = "Infection Spread Over Time (β = 0.01)",
subtitle = "With vs Without Quarantine (20 simulations each)",
x = "Time step (t)",
y = "Cumulative number of infected nodes") +
theme_minimal() +
theme(legend.title = element_blank())+
lims(x=c(0,100))While our earlier simulations with a high infection rate (β = 0.15) showed minimal differences between the quarantine and no-quarantine scenarios, the results under a lower infection rate (β = 0.01) paint a very different picture.
Surprisingly, we observed that the average number of infections was actually higher with quarantine (312 infected nodes on average) than without it (266), a difference of -46 nodes. This contrasts with the previous high-β case, which was comparable in the long run.
This reversal suggests that, under low transmission rates, the current quarantine strategy does not effectively reduce the spread.
There are several possible reasons:
Randomly quarantined nodes might not be important for transmission, so the main spread continues;
Letting quarantined people back in too early (at t = 12) may restart the infection;
When β is low, small changes can have big effects due to randomness.
Taken together, these results indicate that random quarantine is not always a good solution. In fact, at lower transmission levels, it could even make things worse.
A better approach would be to design smarter quarantine strategies — for example, targeting central nodes or adapting quarantine timing based on the situation — to more consistently control the spread in different scenarios.
5) Suppose now that you can convince 5% of people in the network not to spread that information at all.
If we have 5% who will not spread information at all, they are effectively immune. This means we can treat the 5% of people like they had a vaccination. Or in our situation, more like they had a realisation that spreading rumors over the university email is bad and refuse to read or share any gossip. In this section we compare if the targeted or random vaccination works more effectively.
As these people refuse to spread rumours, we call them kind_people.
a) Choose those 5% randomly in the network. Simulate the SIR model above βc using 1% of the remaining nodes as seeds. Choose those seeds randomly.
First we set our random 5% of nodes to make immune. These essentially are deleted, so we remove them from the model.
We set our infectious rate to 5% as well (beta=0.05)
set.seed(1234)
# random vaccination campaign with 5% randomly selected - essentially delete them
kind_people_random <- sample(1:vcount(graph),
vcount(graph)*0.05)
graph_rand <- delete_vertices(graph, kind_people_random)
# print our findings for info
paste0("This random selection removed ", vcount(graph)-vcount(graph_rand), " nodes, and ", ecount(graph)-ecount(graph_rand), " edges from the graph")## [1] "This random selection removed 56 nodes, and 488 edges from the graph"
# now we set our 1% of seeds from the remaining nodes
inf_seeds <- sample(1:vcount(graph_rand),
vcount(graph_rand)*0.01)
# now create our random model
realization_random <- sim_sir(graph_rand,
beta = beta_c,
mu,
inf_seeds) %>%
group_by(t) %>% summarize(ninf=n())
realization_random |>
ggplot(aes(x = t, y = ninf)) +
geom_line() +
labs(title = paste0("Infection trend with randomly assigned nice people (non gossipers)"),
y = "Number of infections",
x = "Time period") +
theme_minimal() +
theme(axis.title.y = element_text(angle = 90))b) Choose those 5% according to their centrality. Simulate the SIR model βc above using 1% of the remaining nodes as seeds. Choose those seeds randomly.
We will use PageRank to select the 5% as this had the most extreme peak in Question 3.
we see that this targeted removal of nodes also takes many more potential connections as over 1800 edges are removed from the plot. This is over 3x as many potential connections removed from random selection of the 5% of nice people (vaccinated).
We see that the trend when removing the 5% of influential nodes is much more effective at killing any growth in infections. In fact we see that the infection actually dies out by period 15 and the spread of information stops.
set.seed(12345)
# targeted selection of most influential 5% randomly selected - essentially delete them, using our exiting pagerank_cent calculations
num_seeds <- vcount(graph)*0.05
kind_people_pager <- order(pagerank_cent, decreasing = TRUE)[1:num_seeds]
graph_pager <- delete_vertices(graph, kind_people_pager)
# print our findings for info
paste0("This pagerank selection removed ", vcount(graph)-vcount(graph_pager), " nodes, and ", ecount(graph)-ecount(graph_pager), " edges from the graph")## [1] "This pagerank selection removed 56 nodes, and 1867 edges from the graph"
# now we set our 1% of seeds from the remaining nodes
inf_seeds <- sample(1:vcount(graph_pager),
vcount(graph_pager)*0.01)
# now create our random model
realization_pager <- sim_sir(graph_pager,
beta = beta_c,
mu,
inf_seeds) %>%
group_by(t) %>% summarize(ninf=n())
#plot
realization_pager |>
ggplot(aes(x = t, y = ninf)) +
geom_line() +
labs(title = paste0("Infection trend with targeted nice people (non gossipers)"),
y = "Number of infections",
x = "Time period") +
theme_minimal() +
theme(axis.title.y = element_text(angle = 90))c) Measure the difference between both cases as you did in step 3.
Using the code from the slides, we plot the 2 trends to compare the outputs over time initially.
Let’s visualize the evolution together with the realization with no vaccination we got earlier.
We see that Random vaccination is better than no vaccination. But targeted is much more effective and essentially flattened the curve. Growth never got exponential.
realization<- results |>
group_by(type, t) |>
summarise(sum_inf = n(), .groups = 'drop') |>
filter(type=="Random")
ggplot() +
geom_line(data=realization,aes(x=t,y=sum_inf,col="No Vaccination")) +
geom_line(data=realization_random,aes(x=t,y=ninf,col="Vacc. Random"))+
geom_line(data=realization_pager,aes(x=t,y=ninf,col="Vacc. Targeted"))However, further simulations are needed to assess whether this effect holds consistently across multiple realizations or was the result of stochastic variance. Haven considered this, we will run a simulation of 30 times instead of comparing just a pair of random results to obtain statistically significance. (Here, we slightly increase the number of simulations, as we observed some unexpected variability in the results.)
### Step 5c: Compare both strategies through multiple simulations
set.seed(1234)
# Parameters
beta_5c <- beta_c
mu_5c <- 0.1
runs_5c <- 30
kind_n <- ceiling(0.05 * vcount(graph))
# Create a new blank data frame
results_5c <- data.frame(
strategy = character(),
total_infected = numeric(),
peak_time = numeric()
)
# ----- 5a: Random kind_people (30x) -----
for (i in 1:30) {
kind_people_random <- sample(1:vcount(graph), kind_n)
graph_rand <- delete_vertices(graph, kind_people_random)
seed_n <- ceiling(0.01 * vcount(graph_rand))
inf_seeds <- sample(1:vcount(graph_rand), seed_n)
sim <- sim_sir(graph_rand, beta = beta_5c, mu = mu_5c, seeds = inf_seeds)
total_inf <- length(unique(sim$inf))
peak <- sim %>%
group_by(t) %>%
summarise(ninf = n()) %>%
filter(ninf == max(ninf)) %>%
pull(t) %>%
min() # in case of multiple peak times
results_5c <- rbind(results_5c, data.frame(
strategy = "Random",
total_infected = total_inf,
peak_time = peak
))
}
# ----- 5b: PageRank kind_people (30x) -----
for (i in 1:30) {
graph_pager <- delete_vertices(graph, kind_people_pager)
seed_n <- ceiling(0.01 * vcount(graph_pager))
inf_seeds <- sample(1:vcount(graph_pager), seed_n)
sim <- sim_sir(graph_pager, beta = beta_5c, mu = mu_5c, seeds = inf_seeds)
total_inf <- length(unique(sim$inf))
peak <- sim %>%
group_by(t) %>%
summarise(ninf = n()) %>%
filter(ninf == max(ninf)) %>%
pull(t) %>%
min()
results_5c <- rbind(results_5c, data.frame(
strategy = "PageRank",
total_infected = total_inf,
peak_time = peak
))
}## ----- Combine and compare -----
summary_5c <- results_5c %>%
group_by(strategy) %>%
summarise(
avg_infected = mean(total_infected),
sd_infected = sd(total_infected),
avg_peak = mean(peak_time),
sd_peak = sd(peak_time)
)
print(summary_5c)## # A tibble: 2 × 5
## strategy avg_infected sd_infected avg_peak sd_peak
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 PageRank 17.6 5.21 0 0
## 2 Random 28.2 13.2 0 0
## ----- Visualisation -----
ggplot(results_5c, aes(x = strategy, y = total_infected, fill = strategy)) +
geom_boxplot() +
geom_jitter(width = 0.1, size = 2, alpha = 0.7) + # Each dot is one simulation
labs(title = "Total Infected (30 runs): Random vs PageRank Kind People",
y = "Total number of infected nodes") +
theme_minimal()ggplot(results_5c, aes(x = strategy, y = peak_time, fill = strategy)) +
geom_boxplot() +
geom_jitter(width = 0.1, size = 2, alpha = 0.7) + # Each dot is one simulation
labs(title = "Peak Time (30 runs): Random vs PageRank Kind People",
y = "Time step of peak") +
theme_minimal()Based on the results of 30 simulations, the PageRank-targeted vaccination strategy clearly outperforms the random vaccination strategy in reducing the spread of infection:
On average, the PageRank strategy resulted in 17.6 infected nodes, compared to 28.2 infected nodes under the random strategy.
The standard deviation is also lower under PageRank (5.2 vs 13.2), suggesting more stable and consistent results.
The peak time was zero in both cases, meaning infections mostly happened early and then quickly subsided — but PageRank consistently kept the total lower.
These findings confirm that removing highly central nodes (according to PageRank) is an effective containment method. These nodes likely act as major bridges or hubs in the network, so vaccinating them breaks key transmission paths. In contrast, the random strategy may occasionally remove critical nodes, but also wastes many “vaccinations” on peripheral nodes, leading to less predictable and less effective results.
In short, when it comes to minimizing spread, targeted immunization based on network centrality is a better strategy than random selection.
7) With the results of step 2, train a model that predicts that time to infection of a node using their degree, centrality, betweeness, page rank and any other predictors you see fit. Use that model to select the seed nodes as those with the smallest time to infection in step 3. Repeat step 5 with this knowledge.
7.1. (Based on step 2) Predictive model
Firstly, we extract the results we obtained in the step 2 as our data for training.
infection_times <- results |>
filter(type=="Random") |> #we only use random infection data, as other infection methods were not used in the step 2 but other steps
group_by(inf) |>
summarise(first_infected = min(t)) |>
rename(node = inf)Then, we calculate the network metrics (degree, centrality, etc.) as inputs for modelling:
metrics_df <- data.frame(
node = as.numeric(V(graph)),
degree = degree(graph),
betweenness = betweenness(graph, normalized = TRUE),
closeness = closeness(graph, normalized = TRUE),
pagerank = page_rank(graph)$vector,
eigen_centrality = eigen_centrality(graph)$vector,
clustering = transitivity(graph, type="local", isolates="zero")
)We join both data frames to have our training and test data:
data_7 <- inner_join(infection_times, metrics_df, by = "node")
set.seed(123)
train_index <- sample(1:nrow(data_7), 0.7 * nrow(data_7))
train_data <- data_7[train_index, ]
test_data <- data_7[-train_index, ]Before creating the predictive models, we would like to discover the distribution and correlation among these variables:
We observed that many network metrics are strongly correlated to each other (like degree and page rank), which could negatively affect the effectiveness of the model. We will run a lasso model to do a feature selection firstly, and take it as a baseline model.
On the other hand, it’s important to notice that the target variable
infection_times is always positive integer numbers, and the
relation with infection_times and some network metrics does
not seem 100% linear. Therefore, after creating the baseline linear
model with lasso, we will use the selected features to create a
glm poisson model.
We first do a lasso model for feature selection:
# create X and y
X_train <- model.matrix(first_infected ~ degree + betweenness + closeness + pagerank +
eigen_centrality + clustering, data = train_data)[, -1]
y_train <- train_data$first_infected
# Lasso model with cross-validation
lasso_model <- cv.glmnet(X_train, y_train, alpha = 1)
# check the best lambda
lasso_model$lambda.min## [1] 0.1670284
## 7 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 54.326624
## degree .
## betweenness .
## closeness -46.962021
## pagerank -1751.859082
## eigen_centrality -9.444537
## clustering 6.740876
We see the selected features are closeness,
pagerank, eigen_centrality and
clustering. Considering this information, now we build up a
glm poissson model:
poisson_model <- glm(first_infected ~ closeness + pagerank + eigen_centrality + clustering,
data = train_data,
family = "poisson")
summary(poisson_model)##
## Call:
## glm(formula = first_infected ~ closeness + pagerank + eigen_centrality +
## clustering, family = "poisson", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.00275 0.15055 26.588 < 0.0000000000000002 ***
## closeness -0.96824 0.55162 -1.755 0.079212 .
## pagerank -55.55192 21.43535 -2.592 0.009553 **
## eigen_centrality -0.34582 0.12934 -2.674 0.007500 **
## clustering 0.18170 0.04847 3.749 0.000178 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2646.2 on 351 degrees of freedom
## Residual deviance: 2465.8 on 347 degrees of freedom
## AIC: 4338.3
##
## Number of Fisher Scoring iterations: 4
Now we test the predictive power of both models:
# Lasso prediction
X_test <- model.matrix(first_infected ~ degree + betweenness + closeness + pagerank +
eigen_centrality + clustering, data = test_data)[, -1]
lasso_preds <- predict(lasso_model, newx = X_test, s = "lambda.min")
# Poisson prediction
poisson_preds <- predict(poisson_model, newdata = test_data, type = "response")
# MSE comparison
mean((test_data$first_infected - lasso_preds)^2)## [1] 246.3044
## [1] 245.9152
The MSE of the poisson model is smaller than the lasso model, then we
will keep the results from the poisson model and save them into the
metrics_df.
7.2. (Based on step 3) Find the predicted 1% earliest infected nodes
Then, as in the step 3, we again try to find a better set of 1% of seeds in the network so we get more infected people than the random case.
metrics_df$predicted_time <- predict(poisson_model, newdata = metrics_df, type = "response")
num_seeds <- ceiling(0.01 * vcount(graph)) #set the percentage of 1%
best_seed_nodes <- metrics_df %>%
arrange(predicted_time) %>%
slice(1:num_seeds) %>% #select the nodes with earlier predicted infection time
pull(node)
head(best_seed_nodes)## [1] 105 333 16 23 42 196
7.3. (Based on step 5) Compare the results of random or targeted vaccination
Then we replicate the step 5 but with the same seed according to the poisson model (targeted seeding), we also do 30 times of simulation for both random and pagerank vaccination.
set.seed(1234)
# Parameters
beta_7 <- beta_c
mu_7 <- 0.1
kind_n <- ceiling(0.05 * vcount(graph))
runs_7 <- 30
# Create result holder
results_7 <- data.frame(
strategy = character(),
total_infected = numeric(),
peak_time = numeric()
)
# --------- 7a: Random vaccination, fixed Poisson seeds ----------
for (i in 1:runs_7) {
kind_people_random <- sample(1:vcount(graph), kind_n)
graph_rand_7 <- delete_vertices(graph, kind_people_random)
valid_seeds_7_rand <- best_seed_nodes[best_seed_nodes %in% as.numeric(V(graph_rand_7))]
sim <- sim_sir(graph_rand_7, beta = beta_7, mu = mu_7, seeds = valid_seeds_7_rand)
total_inf <- length(unique(sim$inf))
peak <- sim %>%
group_by(t) %>%
summarise(ninf = n()) %>%
filter(ninf == max(ninf)) %>%
pull(t) %>%
min()
results_7 <- rbind(results_7, data.frame(
strategy = "Random_vaccination",
total_infected = total_inf,
peak_time = peak
))
}
# --------- 7b: PageRank vaccination, fixed Poisson seeds ----------
for (i in 1:runs_7) {
graph_pager_7 <- delete_vertices(graph, kind_people_pager)
valid_seeds_7_pager <- best_seed_nodes[best_seed_nodes %in% as.numeric(V(graph_pager_7))]
sim <- sim_sir(graph_pager_7, beta = beta_7, mu = mu_7, seeds = valid_seeds_7_pager)
total_inf <- length(unique(sim$inf))
peak <- sim %>%
group_by(t) %>%
summarise(ninf = n()) %>%
filter(ninf == max(ninf)) %>%
pull(t) %>%
min()
results_7 <- rbind(results_7, data.frame(
strategy = "PageRank_vaccination",
total_infected = total_inf,
peak_time = peak
))
}# Summary
summary_7 <- results_7 %>%
group_by(strategy) %>%
summarise(
avg_infected = mean(total_infected),
sd_infected = sd(total_infected),
avg_peak = mean(peak_time),
sd_peak = sd(peak_time)
)
print(summary_7)## # A tibble: 2 × 5
## strategy avg_infected sd_infected avg_peak sd_peak
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 PageRank_vaccination 27.5 10.3 0 0
## 2 Random_vaccination 53.8 26.6 0 0
# Visualisations
ggplot(results_7, aes(x = strategy, y = total_infected, fill = strategy)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 0.6) +
labs(title = "Total Infected (30 runs): Poisson-seeds with different vaccination strategies")ggplot(results_7, aes(x = strategy, y = peak_time, fill = strategy)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 0.6) +
labs(title = "Peak Time (30 runs): Poisson-seeds with different vaccination strategies")Here, the results seem quite similar to the previous step 5: as we set the beta at the critical beta, both random and targeted (by pagerank) vaccination have the peak time at t=0. And regarding the total infected number, it is still significantly lower under targeted vaccination (27.46667) than under random vaccination (53.80000), where the targeted vaccination manages to control the standard deviation as well (10.33819 < 26.63469).
However, the total infected number here (targeted seeding) is much higher than in the step 5 (random seeding) for both vaccination strategies, targeted/pagerank (27.46667>17.60000) and random (53.80000>28.16667), while the standard deviation here are higher, too.
7.4. Conclusion
If we combine the insights from step 5 and step 7, we can conclude that, on the one hand, no matter whether the vaccination strategy is taken (targeted or random), the initial seed was more effective when it’s targeted seeding than when it’s random seeding. On the other hand, no matter whether the initial seed was random or targeted, the targeted vaccination is always a more effective containment method.
The last conclusion is compatible with the mathematical proofs we learned from class. The targeted vaccination will naturally delete the high degree nodes, which reduces significantly the \({\langle k^2 \rangle}\). This effect can reflect in:
\[ R_0 = \frac{\beta}{\mu} \cdot \frac{\langle k^2 \rangle}{\langle k \rangle} \]with lower \({\langle k^2 \rangle}\), the \(R_0\) becomes reduced, meaning lower probability of spreading;
And \[ g_c \approx 1 - \frac{\mu \langle k \rangle_{k \leq k_M}}{\beta \langle k^2 \rangle_{k \leq k_M}} \] with lower\({\langle k^2 \rangle}\), the critical vaccination threshold, \(g_c\) , becomes reduced, meaning the lower vaccination coverage is required to stop the spread.
Final conclusions
Throughout this project, we explored how rumors spread in a real-life university email network using the SIR model. We tried out different strategies—both for picking who starts the rumor and who we “vaccinate” (i.e., prevent from spreading it)—and found some interesting results:
First, we saw in action the epidemic threshold formula from class: \[ \beta_c = \mu \frac{\langle k \rangle}{\langle k^2 \rangle}\] When the infection rate β goes above this value, the rumor spreads like wildfire. Below it, nothing much happens. It was great to see this theory come to life in our simulation results.
When it comes to who starts the spread, using central nodes (like those with high PageRank or degree) clearly makes the rumor spread faster and more widely than picking random people. That matched what we expected from network theory.
For vaccination, targeting the most central people again made a huge difference—it stopped the spread much more effectively than random vaccination. The idea that removing hubs slows down the spread really worked here.
In short, this project helped us test a lot of the mathematical concepts from class using real data and real simulations. It was cool to see how theory translates into actual strategies that work—and to see just how much network structure matters when it comes to spreading anything, whether it’s gossip, viruses, or ideas.
6) Comment on the relationship between the findings in steps 3 and 5 using the same type of centrality for the 1% in step 3 and 5% in step 5.
Here we compare our 2 PageRank methods applied in Q3 and Q5B. To summarise, we did:
In Q3 - apply a very high beta value (15%) above the critical value to maximise our infection spread and used centrality measures to select the 1% of infections using PageRank.
In Q5 - we select 5% of the population using PageRank to essentially treat them as kind_people who will not spread any rumours.
To compare the outputs we first outline differences in the methods:
Fewer initial infected nodes in Q5. Due to sequencing, we already remove 5% of nodes, then apply the 1% infection (per the question), we remove 10 nodes initially. While in Q3, we take our 1% sample from the whole unaffected population, where we have 12 initial infected nodes.
Q3 intends to maximise spread by infecting the most connected nodes, while Q5 intends to prevent spread by removing the most infected nodes.
Q3 specifies to use a high beta value (we select \(\beta_c = 0.15\), while Q5 specifies to use the critical beta value (\(\beta_c = 0.00535\)). This means our infectivity is around 28 times higher in Q3 compared to Q5.
Overall, these 2 questions seek very different things but both highlight the importance of central nodes when considering the spread of rumors in the university email system. These central nodes could be considered the super-spreaders at risk of creating a more rapid epidemic.
Combining the findings, we see that who is infected initially is highly predictive of whether information will spread or die out. The influential people in a social network have so much power, if they work together to spread a rumor over the university email, they will almost succeed. While if they choose to completely disengage and be kind people, the rumor will not survive! Let’s hope they are nice people.